Entity Identification and/or Association Using Multiple Data Elements

ABSTRACT

Data values from a plurality of data elements can be combined to form one or more entity identifiers to facilitate identifications of and/or associations among a plurality of data records representing one or more entities. Associated data records can represent the same entity and/or multiple entities that can be properly associated. Associations can be made among two or more unique entities and/or their respective representative data records if they correspond to substantially the same entity identifier. In one embodiment, the number, type, and/or characteristics of values for data elements used to form an entity identifier can be selected so that the entity identifier is substantially statistically unique.

RELATED APPLICATIONS

This patent application is a continuation of and claims the benefit ofpriority from U.S. Nonprovisional patent application Ser. No.11/818,908, filed Jun. 14, 2007, which is a nonprovisional of and claimsthe benefit of priority from U.S. Provisional Patent Application No.60/813,792, filed Jun. 14, 2006, both of which are hereby incorporatedby reference in their entirety.

COPYRIGHT NOTICE

©2007 TransUnion TeleData, LLC. A portion of the disclosure of thispatent document contains material that is subject to copyrightprotection. The copyright owner has no objection to the facsimilereproduction by anyone of the patent document or the patent disclosure,as it appears in the Patent and Trademark Office patent file or records,but otherwise reserves all copyright rights whatsoever. 37 CFR §1.71(d),(e).

TECHNICAL FIELD

Embodiments described in the present application relate to the field ofidentification and/or association of one or more data recordsrepresenting entities within one or more data sources.

BACKGROUND

Many data sources, such as commercial data repositories, utility companycustomer databases, etc., to list only a few examples, store datarecords corresponding to individual entities, such as people, companies,etc. The data records are typically comprised of multiple data elements,and the value for each data element typically represents a particularaspect of the entity's identity, or other information related to theentity. Numerous commercial and noncommercial enterprises employ suchdata sources in a variety of ways as an integral part of their productor service offerings and daily operations.

Unfortunately, given the potentially vast array of records a data sourcecan include, it often proves to be a challenge to search, analyze,and/or manipulate the entity-representing data in a meaningful way.Furthermore, some data sources contain inaccurate or out-datedinformation. For example, even using a well-indexed data source, itoften can be difficult to identify with sufficient certainty that one ormore particular records actually correspond to the specific entity theyputatively represent. It can also be difficult to identify associationsbetween multiple seemingly independent entity data records. Due tovariations in the type, amount, and structure of data elements each datasource can employ for its respective data records, the challenges ofidentifying and associating individual entities can be greatly magnifiedif multiple data sources are employed.

SUMMARY

Embodiments consistent with the present application can utilize, atleast in part, entity data records comprising multiple data elements tofacilitate identification of entities and/or associations being madeamong entities represented by the data records. The data records canoriginate from and/or be maintained within one or more data sources.Such embodiments can combine data values from a plurality of dataelements to form an entity identifier, which can serve, at least inpart, as a key for facilitating the identification of and/orassociations among entities represented by a plurality of data records.

In one embodiment, the entity identifier can facilitate theidentification of one or more data records corresponding to a uniqueentity, from one or more data sources. In addition or in thealternative, an embodiment can facilitate the association of multipledata records that represent the same entity, and/or multiple datarecords representing separate, associated entities. For example,associations can be made among two or more unique entities and/or theirrespective representative data records if they correspond tosubstantially the same entity identifier. In one embodiment, the number,type, and/or characteristics of data elements used to form an entityidentifier can be selected so that the entity identifier is expected torepresent an individual entity and/or associated entities with at leasta predetermined confidence level.

Additional aspects and advantages will be apparent from the followingdetailed description of preferred embodiments, which proceeds withreference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a system in accordance with one embodiment.

FIG. 2 presents one embodiment of a process flow diagram consistent withthe claimed subject matter.

FIG. 3 conceptually illustrates associations among entities using entityidentifiers in accordance with one embodiment.

FIG. 4 presents a second embodiment of a process flow diagram consistentwith the claimed subject matter.

DETAILED DESCRIPTION

Embodiments consistent with the present application can be implementedas systems, apparatuses, methods, and/or other implementations ofsubject matter for combining a plurality of data element values fromdata records originating from one or more data sources to form an entityidentifier that can be used, at least in part, to facilitateidentification of and/or associations among entities, as represented bythe data records. In one embodiment, data elements selected to form anentity identifier can be selected at least in part, so that theirrespective data values can be combined and/or otherwise employed to forman entity identifier that can be substantially statistically unique.Employing an entity identifier that is substantially statisticallyunique can facilitate identifications and/or associations being madewith confidence levels that are appropriately high for a givenapplication. As used throughout this application and the attachedclaims, the term “confidence level” corresponds to a probability that anidentified association does not represent a false positive association.Thus, the higher the confidence level, the less likely it is that datarecords will be erroneously associated with one another.

One advantage of embodiments consistent with the claimed subject matteris the ability to tailor the extent to which an entity identifier isstatistically unique, which can correspondingly yield appropriatelytailored confidence levels in the associations made among data records.The degree of uniqueness can be selected to suit the particularapplication, field of use, and/or implementation in which the entityidentifier is to be employed. In certain embodiments and/orimplementations, a high confidence level in the identification and/orassociation results can be desirable. In such instances a morestatistically unique entity identifier can be used. In alternativeembodiments and/or implementations, lower confidence in theidentification and/or association of unique entities can be acceptable.In such instances an entity identifier that is less statistically uniquecan be employed. The phrase “substantially statistically unique” isemployed herein consistent with the above described notions offlexibility, scalability, and customization. Two different entityidentifiers can both be considered substantially statistically unique,even if one is more statistically unique than the other. One or moreembodiments can require that an entity identifier should be at leaststatistically unique enough to provide meaningful and/or useful resultsfor a given implementation.

In an implementation for which a relatively low result confidenceassociation is acceptable, a relatively wider variety of data elementscan be selected to form an entity identifier. Such implementations canemploy data elements having values that are not very unique toindividual entities, such as a name or date of birth, as but twoexamples. In a large dataset, there can be multiple data recordsrepresenting several different entities with the same name and date ofbirth. However, in an implementation that requires entities to beidentified and/or associated with a high-degree of confidence, dataelements can be selected so as to form entity identifiers that canindicate associations with a confidence level that is sufficiently highfor a particular intended application. As disclosed in more detailbelow, customization through varying the selection and/or number of dataelements employed to form entity identifiers, as well as other factors,can allow the accuracy and/or reliability of operations performed on thedata records to be tailored, at least in part, in accordance with therequirements of each specific implementation.

For purposes of facilitating discussion, and not by way of limitation onthe claimed subject matter, one example of an entity identifierembodiment, presented for illustrative purposes, can be formed usingdata elements representing components of present and/or historic contactinformation and/or other identifying data stored in data recordsrepresenting individual entities. For example, such data elements caninclude values representing, in whole or in part, address, phone number,e-mail handle, and/or other data representing an entity, to name but afew examples. An entity identifier embodiment can be formed from valuesfor these and/or other contact information data elements and can beemployed, at least in part, to facilitate identification of and/or anassociation among one or more entities as disclosed in more detailbelow.

In one or more embodiments, the statistical uniqueness of an entityidentifier can be improved by selecting data elements to form the entityidentifier that have values that are as evenly distributed across thepopulation of entities as possible and/or practicable. In oneembodiment, even distribution of values can be reasonably achieved byselecting data elements having values that are believed to besubstantially randomly assigned to the entities represented by the datarecords. The concept of even distribution of values can be illustratedgraphically as a histogram, frequency diagram, and/or other suitabledepiction graphing the range of possible values for a selected dataelement against the number of instances of each possible value occurringin a given set of data records. Data elements having a distribution ofvalues that graphs more flat, rather than as a bell-curve, canfacilitate the formation of entity identifiers that are expected toidentify associations among the data records with increased confidence.

To illustrate the above point, a data element storing values for thelast four digits of a contact phone number would facilitate formation ofan entity identifier that would yield higher confidence results thanwould an entity identifier formed from values for a data element storingzip code data. This is because zip codes, as in the example of theUnited States postal system designation, are not randomly assigned to,or evenly distributed among, entities. Rather, they are assigned basedon entity location within geographic groupings. In comparison, the lastfour digits of an entity's contact phone number more closely approximatea random distribution throughout the population of entities. However,using a data element storing full telephone number values decreases theevenness and/or reasonable randomness of the value distribution, astelephone number area codes and prefixes are assigned based at least inpart on geographic grouping. Similarly, the values for the last fourdigits of a nine-digit U.S. Social Security Number are relatively evenlyand/or reasonably randomly assigned throughout the population of U.S.individuals, while the values for the first three and middle two digitsexhibit grouping characteristics. Data records that have multiple commonvalues for data elements having reasonably random and/or evenlydistributed values can be more confidently associated with one anotherthan can data records that have multiple common values for data elementshaving values with grouped distribution among the entities. Associatingdata records based on commonalities in poorly distributed values canlead to false positive associations. For example, it is possible thattwo data records can coincidentally contain two or more matchinghistorical zip code values even though the data records representneither the same entity nor entities that should be properly associated(such as family members, spouses, non-familial cohabitants, etc.). Byobtaining knowledge of the content and/or characteristics of the datarecords and data elements included therein, specific data elements canbe selected so as to form entity identifiers that can be used toassociate data records with sufficiently high confidence and reducedinstances of false positive associations.

To facilitate discussion, one or more embodiments are described below asemploying house number values as contact information data elements usedto form the entity identifiers. For illustration, if a data recordincludes an address of 1234 Main Street for an entity, the “1234”portion of the address is an example of a house number. Use of the term“house number” however is not meant to limit the claimed subject matterto addresses for houses, which are typically unattached single familydwellings. The term “house number” can apply to the correspondingportion of any address data, regardless of the form or type of dwelling,building, or edifice that exists at that location. Furthermore, a housenumber is but one example of a data element that can be employedconsistent with the present application. House-number embodiments aredescribed below only for illustrative purposes and not by way oflimitation on the claimed subject matter. Those skilled in the relevantart will appreciate that additional and/or alternative data elements canalso be employed consistent with this application and the claimedsubject matter.

Continuing with reference to embodiments employing house-number dataelements, for purposes of discussion, such embodiments can beimplemented to facilitate entity authentication and/or identificationwith improved accuracy and reliability. This is facilitated, at least inpart, by the fact that, within a range of common house number values,the values can be sufficiently evenly distributed among, and/or randomlyassigned to, entities. Embodiments can use present and/or historic housenumber values, as but two examples. In addition to selecting housenumbers as the type of data element to use in forming an entityidentifier, the quantity, specifications, and/or characteristics of thehouse number data elements can also be chosen so as to achieve aconfidence level that is substantially sufficient and/or tailored for aparticular application and/or implementation. Such choices can be made,at least in part, based on the number and/or characteristics of theavailable data records and/or the entities the data records represent.As but one example, in an implementation using data records from datasources that reflect address histories for entities, an embodiment cancombine values from a predetermined quantity of house number dataelements associated with an entity to form one or more entityidentifiers. For example, in one implementation, having a particulardata set and/or grouping of data records, an entity identifier formedfrom values for two house numbers can be sufficiently unique to identifyuseful associations. A different implementation can require that valuesfrom three or more house numbers are used to form entity identifiers.Other variations are also possible consistent with the claimed subjectmatter.

Consistent with the claimed subject matter, multiple data elements canbe selected based at least in part on having values exhibitingcharacteristics that make them suitable for combining to form an entityidentifier that is substantially statistically unique for a given set ofdata records and/or represented entities. For example, in oneembodiment, the number of possible values for an entity identifier canbe approximated as the product of the number of digits composing theentity identifier times the number of available, distinct values perdigit. Specific data elements can be selected so the number of possibleunique values for an entity identifier formed from values for theselected data elements exceeds the number of distinct entities withinthe population. Such an entity identifier can be consideredsubstantially statistically unique with respect to the population ofentities represented by the data records.

The factor by which the number of possible unique entity identifiervalues exceeds the number of entities can be customized for a desiredconfidence level in associations made among data records. The greaterthe factor of excess, the more statistically unique the entityidentifier is and the better the quality and specificity of theassociations made using that entity identifier. A factor can bepredetermined and can be designated, selected, and/or applied for anintended application and/or specific implementation to yieldassociations having a desired confidence level and/or quality.

For example, in an embodiment having data records representing apopulation of approximately 300,000,000 entities, data records can beselected so as to form one or more substantially statistically uniqueentity identifiers for the given population. Additionally, the extent towhich the formed entity identifier is statistically unique relative tothe applicable population can be customized based, at least in part, onselection of data elements. For example, if data elements are selectedsuch that a formed entity identifier includes nine digits, with eachdigit possessing ten possible numerical values (0-9), then there areapproximately one billion possible values for the entity identifier(10̂9). This represents a factor of about 3.33, meaning there areapproximately three and one third possible entity identifier values perentity in the population. In an alternative embodiment, data elementscan be selected so that a formed entity identifier includes twelvedigits, with each digit possessing ten possible numerical values (0-9).In such embodiment, there are approximately one trillion possible entityidentifier values (10̂12). Because the number of possible entityidentifier values exceeds the number of entities within the populationby a factor of over 3,333, the corresponding twelve-digit entityidentifiers are more substantially statistically unique than were thenine-digit entity identifiers. It should be noted that various dataelements can be combined to achieve the results indicated above. Forexample, two data elements with six-digit values can combine to form anentity identifier that is statistically comparable to an entityidentifier formed from three data elements having values of four-digitseach. The threshold for quantifying and/or qualifying a predeterminedfactor for a specific implementation and/or application can bedetermined based at least in part on a number of applicableconsiderations, including, without limitation, the extent to which thevalues for the selected data elements are randomly assigned and/orevenly distributed among the entities in the population, and theapplication's tolerance for false-positives, to name only a couple ofexamples. Those skilled in the relevant arts will appreciate thatcertain requirements, considerations, and/or characteristics of anintended application can require associations to achieve a specificconfidence level and an appropriately applicable factor value and/oracceptable range of factor values can be determined accordingly.

Continuing for illustrative purposes with the example of house numberdata elements, and as one example of an additional and/or alternativerequirement, entity identifiers can be defined to include apredetermined number of digits (e.g., 10, 12, 20, etc.). A sufficientnumber of house number data elements can be combined to achieve thedesired number of digits in the entity identifier. Those skilled in therelevant arts will appreciate that increasing the quantity of housenumbers providing values used to create the entity identifier alsoincreases the statistical uniqueness of the formed entity identifier. Asadditional quantities of house number values are combined to create theentity identifier, it becomes statistically less likely that the sameidentifier can match multiple data records without the data recordseither referring to the same entity or entities that can be properlyassociated with one another (familial relatives, roommates, etc.).

At least in part by choosing sufficiently restrictive data elementrequirements and/or characteristics to form the entity identifiers,entity identifiers can be created that are substantially statisticallyunique (e.g., it can be said with substantially high statisticalconfidence, sufficient for the intended application and/orimplementation, that a specific entity identifier corresponds to eitherone unique entity or separate unique entities that can be properlyassociated). For example, in one embodiment, entities in the UnitedStates for which sufficient corresponding address data records exist canbe identified or associated using three or more house numbers containedin their financial, utility, or other address history records. In suchan embodiment, the degree and/or extent of statistical uniqueness of theresulting identifier can be sufficiently and/or substantially high ifthe three house numbers contain a total of twelve or more digits whencombined. This is because, in the Unites States, for example, a majorityof house numbers have values with three or more digits that rangebetween 100 and 20000. By volume, the majority of house numbers havefour digits. Therefore, the odds of any two entity data recordsincluding the same three house numbers, regardless of sequence, withoutthe represented entities being associated is approximately (20000−100)̂3,or 1 in 7,880,599,000,000. In alternative implementations (for example,in a system for associating entities with addresses outside the UnitedStates, etc.), other characteristics can be chosen for the data elementsused to form the entity identifier. Data elements can be chosen so as toproduce confidence levels that are specifically tailored for the givenapplication, implementation, and/or data records.

Because of the statistical improbability of two discrete entitiesrandomly sharing a common entity identifier with the abovecharacteristics, embodiments consistent with the claimed subject mattercan implement such substantially statistically unique entity identifiersin a wide variety of business applications and/or for otherimplementations and/or purposes that can require substantially highlevels of accuracy in identifying and/or associating entitiesrepresented by data records from one or more data sources. FIG. 1illustrates one example of a system for implementing identificationand/or association embodiments consistent with the claimed subjectmatter. The system of FIG. 1 is presented for illustrative purposes andto facilitate discussion; it is not meant as a limitation on the scopeof the attached claims, and those skilled in the relevant art willappreciate that apparatuses or other systems can be provided with fewer,alternative, and/or additional components and/or configurations whileremaining consistent with the claimed subject matter.

With specific reference to FIG. 1, a computer system 100 is provided toaccess data records from one or more data sources 102. Computer system100 can access data sources 102 directly, or via an optional networkconnection 104, such as the Internet, an intranet, LAN, WAN, and/orother network. Accordingly, data sources 102 can be maintained locallyand/or remotely with respect to the location of computer system 100.Computer system 100 can also include and/or have access to a processingengine 106 capable of executing programming instructions for generatingand/or applying one or more entity identifiers to identify and/orassociate entities represented by the data records in data sources 102.Results of identification and/or association operations can be furtherprocessed and/or applied within computer system 100, or they can beorganized for and/or communicated to one or more separate systems forsubsequent handling, if or to the extent necessary and/or desirablegiven the intended functionality and/or specific implementation in whichan embodiment operates.

Consistent with the present application, apparatuses or systems, such asthe system illustrated in FIG. 1, can implement various identificationand/or association processes using substantially statistically uniqueidentifiers as disclosed herein. FIG. 2 presents a process flow diagramincluding examples of steps that can be included in one embodiment ofsuch a process. In particular, FIG. 2 can facilitate the identificationof non-obvious associations between entities. The process of FIG. 2 caninclude step 200, for establishing access to data records, which caninclude securing access to data records not previously accessed and/orpossessed. The data records can be contained within, maintained by,and/or otherwise made available from one or more separate, discrete datasources. At step 202, data records for each individual entity can beprocessed to identify entity identifiers that correspond to that entity.As one example illustrating step 202 using house numbers, if an entity'sdata record has house number values including 900, 725, 1255, and 1221,using combinations of three house numbers, and ignoring order, thefollowing four entity identifiers can be formed: 9007251255, 9007251221,72512551221, and 90012551221.

At step 204, results of step 202 can be organized and/or grouped. Oneexample grouping embodiment can include grouping results first accordingto entity identifier, and then according to individual entitycorresponding to each entity identifier. Other grouping methodologiescan additionally and/or alternatively be employed. Using the resultsgrouped in step 204, step 206 identifies associations among the entitiesand can initiate and/or facilitate additional processing of one or moreof the data records based, at least in part, on the identifiedassociations. Entities that share entity identifiers can be associated.If an entity identifier from the list grouped in step 204 corresponds totwo data records, the entities represented by those data records canaccordingly also be associated. The records representing those entitiesare either separate records representing the same entity, or separaterecords representing different entities that can be properly associatedwith one another. Context or data within the data records, application,and/or specific data elements can be used to distinguish between the twotypes of associations. For example, presence of the same substantiallyuniversal key, such as a full Social Security Number, in both datarecords can indicate that the associated records represent the sameentity.

FIG. 3 presents a diagram conceptually illustrating the association ofmultiple entities and/or data records using a common entity identifier.For an embodiment as illustrated in FIG. 3, data elements, such as housenumbers, in one or more data records for a first entity data record 300can form two entity identifiers, illustrated in FIG. 3 as entityidentifier 302, and entity identifier 304. Entity data record 306 canalso form entity identifier 304. Entity data record 308 forms entityidentifier 310. Because entity data record 300 and entity data record306 share entity identifier 304 in common, they can be associated.However, because entity data record 308 does not share a common entityidentifier, it cannot be associated with either entity data record 300or entity data record 306.

FIG. 4 presents an alternative methodology and process flow to thatdepicted in FIG. 2. FIG. 4 illustrates a process flow diagram foridentifying and/or associating entities using, as separate components,individual data elements that collectively can be combined to form anentity identifier. Given an original entity for which the values of dataelements in a representative data record are known, an embodimentimplementing the process of FIG. 4 can identify additional recordscorresponding to the same entity, as well as separate entities that canbe associated with the original entity. With particular reference toFIG. 4, step 400 begins by identifying and selecting the type and/orquantity of data elements that can be employed so as to yield asubstantially statistically unique entity identifier. In step 402, thevalues for each of those data element components are gathered from anidentified data record representing an original entity. In step 404, oneor more data sources can be queried to identify data records thatinclude a value matching and/or substantially matching the value for anyof the data element components identified in step 402. Separatesearches/queries can be executed for each data element value. In step406, associations can be identified among entities included in thesearch results. For example, data records with element values matchingeach of the search queries either represent the original entity, orentities that can be associated with the original entity withsubstantially statistical reliability. For example, in one embodimentconsistent with the claimed subject matter, presented for illustrativepurposes, and not by way of limitation, if a substantially statisticallyunique entity identifier can be formed for a given set of data recordsusing three house numbers, separate searches can be conducted using eachhouse number as a query, entities represented by data records thatappear three times in the search results list, indicating that the datarecord included a match for each separate house number value searched,represent the original entity and/or associated entities (e.g., relatedentities sharing a common address history, etc.).

For efficiency or process optimization purposes, the searching procedurecan also employ filtering logic to substantially reduce processingrequirements when executing searches. Rather than searching all datarecords in all data sources for matches based on each search criterion,searching on the second criterion can be limited to those entitiesreturned as results of a search on the first criterion. Similarly, asearch performed using the third criterion can be limited to the resultsof the second search, and so on.

It should also be noted that, consistent with the claimed subjectmatter, entity identifiers corresponding to two or more data records donot have to represent exact matches in order for the data records and/orthe corresponding entities to be associated. Based, at least in part, onfactors such as the tolerance for false positive associations in a givenapplication and/or implementation, a certain acceptable margin of errorcan be allowed for purposes of identifying matches between entityidentifiers or selected data element component values. The use of thephrases “substantially matching,” “substantial match,” or the like inthis application and the attached claims is meant to indicate matchesthat are either exact, or within a predetermined acceptable margin oferror for a given implementation and/or application. For example, in anembodiment using an entity identifier formed from multiple data elementvalues, the order in which the data values appear in two or more entityidentifiers can be ignored for purposes of comparing entity identifiersand associating corresponding data records. An alternative embodimentcan elect to ignore duplicate values in the formation and/or comparisonof entity identifiers. Still other embodiments can allow for othervariances in exact matching to be allowed. A few such examples caninclude rounding conventions for numeric data values, and/or commonsynonyms, abbreviations, and/or alternative spellings for alphanumericdata values, to illustrate but a few examples.

It will be obvious to those having skill in the art that many changesmay be made to the details of the above-described embodiments withoutdeparting from the underlying principles of the invention. The scope ofthe present invention should, therefore, be determined only by thefollowing claims.

1. A method for associating data records representing one or moreentities, comprising: obtaining access to a plurality of data records,each data record including a plurality of data elements; selecting twoor more data elements from the plurality of data elements, the selecteddata elements being selected so as to enable one or more entityidentifiers to be formed from values for the selected data elements fromthe plurality of data records; and associating a first data record witha second data record if a first entity identifier formed from values forthe selected data elements from the first data record substantiallymatches a second entity identifier formed from values for the selecteddata elements from the second data record.
 2. The method of claim 1wherein the data elements are selected so that the formed one or moreentity identifiers are substantially statistically unique.
 3. The methodof claim 2 wherein the data elements are selected so that the formed oneor more entity identifiers have a number of possible values in excess ofa number of entities represented by the plurality of data records. 4.The method of claim 3 wherein the selected data elements are selected sothat the number of possible values for the formed one or more entityidentifiers exceeds the number of entities by at least a predeterminedfactor.
 5. The method of claim 4 wherein the predetermined factor isdetermined based at least in part on a source for the plurality of datarecords.
 6. The method of claim 4 wherein the predetermined factor isdetermined at least in part according to an intended purpose forassociating the first data record with the second data record.
 7. Themethod of claim 4 wherein the predetermined factor is determined so thatthe associating of the first data records with the second data recordachieves at least a predetermined confidence level.
 8. The method ofclaim 1 wherein the selected data elements encompass values that aresubstantially randomly assigned to entities represented by the pluralityof data records.
 9. The method of claim 1 wherein the selected dataelements encompass values that are substantially evenly distributedamong entities represented by the plurality of data records.
 10. Themethod of claim 1, further comprising defining a criterion for the oneor more entity identifiers, wherein the data elements are selected sothat the one or more entity identifiers are formed to satisfy thecriterion.